{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Outline\n",
    "\n",
    "* Collect some tweets\n",
    "* Annotate the tweets \n",
    "* Calculate the accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pprint import pprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Collect some data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# we'll use data from a job that collected tweets about parenting\n",
    "tweet_bodies = [body for body in open('tweet_bodies.txt')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# sanity checks\n",
    "pprint(len(tweet_bodies))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# sanity checks\n",
    "pprint(tweet_bodies[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# lets do some quick deduplication\n",
    "from duplicate_filter import duplicateFilter\n",
    "\n",
    "## set the similarity threshold at 90%\n",
    "dup_filter = duplicateFilter(0.9)\n",
    "\n",
    "deduped_tweet_bodies = []\n",
    "for id,tweet_body in enumerate(tweet_bodies):\n",
    "    if not dup_filter.isDup(id,tweet_body):\n",
    "        deduped_tweet_bodies.append(tweet_body)\n",
    "\n",
    "pprint(deduped_tweet_bodies[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Annotate the data\n",
    "\n",
    "## Start by tokenizing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from nltk.tokenize import TweetTokenizer\n",
    "tt = TweetTokenizer()\n",
    "tokenized_deduped_tweet_bodies = [tt.tokenize(body) for body in deduped_tweet_bodies]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# sanity checks\n",
    "len(tokenized_deduped_tweet_bodies)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(tokenized_deduped_tweet_bodies[:2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Now tag the tokens with parts-of-speech labels\n",
    "\n",
    "The default configuration is the Greedy Averaged Perceptron tagger (https://explosion.ai/blog/part-of-speech-pos-tagger-in-python)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from nltk.tag import pos_tag as pos_tagger\n",
    "tagged_tokenized_deduped_tweet_bodies = [ pos_tagger(tokens) for tokens in tokenized_deduped_tweet_bodies] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pprint(tagged_tokenized_deduped_tweet_bodies[:2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# let's look at the taxonomy of tags; in our case derived from the Penn treebank project \n",
    "# (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.8216&rep=rep1&type=pdf)\n",
    "\n",
    "import nltk\n",
    "nltk.help.upenn_tagset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# let's peek at the tag dictionary for our tagger\n",
    "\n",
    "from nltk.tag.perceptron import PerceptronTagger\n",
    "t = PerceptronTagger()\n",
    "pprint(list(t.tagdict.items())[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluate the annotations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We must choose which parts of speech to evaluate. Let's focus on adjectives, which are useful for sentiment analysis, and proper nouns, which provide a set of potential events and topics. \n",
    "\n",
    "* JJ: adjective or numeral, ordinal\n",
    "* JJR: adjective, comparative\n",
    "* JJS: adjective, superlative\n",
    "\n",
    "\n",
    "* NNP: noun, proper, singular\n",
    "* NNPS: noun, proper, plural"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "adjective_tags = ['JJ','JJR','JJS']\n",
    "pn_tags = ['NNP','NNPS']\n",
    "tag_types = [('adj',adjective_tags),('PN',pn_tags)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# print format: \"POS: TOKEN --> TWEET TEXT\"\n",
    "\n",
    "for body,tweet_tokens,tagged_tokens in zip(deduped_tweet_bodies,tokenized_deduped_tweet_bodies,tagged_tokenized_deduped_tweet_bodies):\n",
    "    for token,tag in tagged_tokens:\n",
    "        if tag in adjective_tags:\n",
    "        #if tag in pn_tags:\n",
    "            print_str = '{}: {} --> {}'.format(tag,token,body)\n",
    "            print(print_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These seem like dreadful results. Let's try a different NLP engine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stanford CoreNLP\n",
    "\n",
    "Download:\n",
    "\n",
    "http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip\n",
    "\n",
    "Then unzip. Start up the server from the unzipped directory:\n",
    "\n",
    "`$ java -mx4g -cp \"*\" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from corenlp_pywrap import pywrap\n",
    "cn = pywrap.CoreNLP(url='http://localhost:9000', annotator_list=[\"pos\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "corenlp_results = []\n",
    "\n",
    "for tweet_body in deduped_tweet_bodies:\n",
    "    try:\n",
    "        corenlp_results.append( cn.basic(tweet_body,out_format='json').json() )\n",
    "    except UnicodeEncodeError:\n",
    "        corenlp_results.append( {'sentences':[]} )\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# pull out the tokens and tags\n",
    "corenlp_tagged_tokenized_deduped_tweet_bodies = [ [(token['word'],token['pos']) for sentence in result['sentences'] for token in sentence['tokens']] for result in corenlp_results]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# print format: \"POS: TOKEN --> TWEET TEXT\"\n",
    "\n",
    "for body,tagged_tokens in zip(deduped_tweet_bodies,corenlp_tagged_tokenized_deduped_tweet_bodies):\n",
    "    for token,tag in tagged_tokens:\n",
    "        #if tag in pn_tags:\n",
    "        if tag in adjective_tags:\n",
    "            print_str = '{}: {} --> {}'.format(tag,token,body)\n",
    "            print(print_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Conclusions and next steps\n",
    "\n",
    "For Tweet bodies:\n",
    "* NLTK TweetTokenizer is pretty good\n",
    "* NLTK default POS tagger is dreadful\n",
    "* CoreNLP POS tagger is better than NLTK\n",
    "\n",
    "Next steps:\n",
    "\n",
    "* Make more careful accuracy measurements\n",
    "* Get a better tokenizer into CoreNLP\n",
    "* Look at other POS taggers\n",
    "* Compare other tags, such as sentiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
